- Home
- Search Results
- Page 1 of 1
Search for: All records
-
Total Resources2
- Resource Type
-
0002000000000000
- More
- Availability
-
02
- Author / Contributor
- Filter by Author / Creator
-
-
Ali, Asmer Hamid (2)
-
Fan, Deliang (2)
-
Chen, Yiran (1)
-
Guo, Cheng (1)
-
Hwang, William (1)
-
Sridharan, Amitesh (1)
-
Tsai, Wilman (1)
-
X_Wang, Shan (1)
-
Yang, Li (1)
-
Zhang, Fan (1)
-
Zhang, Jeff (1)
-
#Tyler Phillips, Kenneth E. (0)
-
#Willis, Ciara (0)
-
& Abreu-Ramos, E. D. (0)
-
& Abramson, C. I. (0)
-
& Abreu-Ramos, E. D. (0)
-
& Adams, S.G. (0)
-
& Ahmed, K. (0)
-
& Ahmed, Khadija. (0)
-
& Aina, D.K. Jr. (0)
-
- Filter by Editor
-
-
& Spizer, S. M. (0)
-
& . Spizer, S. (0)
-
& Ahn, J. (0)
-
& Bateiha, S. (0)
-
& Bosch, N. (0)
-
& Brennan K. (0)
-
& Brennan, K. (0)
-
& Chen, B. (0)
-
& Chen, Bodong (0)
-
& Drown, S. (0)
-
& Ferretti, F. (0)
-
& Higgins, A. (0)
-
& J. Peters (0)
-
& Kali, Y. (0)
-
& Ruiz-Arias, P.M. (0)
-
& S. Spitzer (0)
-
& Sahin. I. (0)
-
& Spitzer, S. (0)
-
& Spitzer, S.M. (0)
-
(submitted - in Review for IEEE ICASSP-2024) (0)
-
-
Have feedback or suggestions for a way to improve these results?
!
Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
-
With the rapid advancement of DNNs, numerous Process-in-Memory (PIM) architectures based on various memory technologies (Non-Volatile (NVM)/Volatile Memory) have been developed to accelerate AI workloads. Magnetic Random Access Memory (MRAM) is highly promising among NVMs due to its zero standby leakage, fast write/read speeds, CMOS compatibility, and high memory density. However, existing MRAM technologies such as spin-transfer torque MRAM (STT-MRAM) and spin-orbit torque MRAM (SOT-MRAM), have inherent limitations. STT-MRAM faces high write current requirements, while SOT-MRAM introduces significant area overhead due to additional access transistors. The new STT-assisted-SOT (SAS) MRAM provides an area-efficient alternative by sharing one write access transistor for multiple magnetic tunnel junctions (MTJs). This work presents the first fully digital processing-in-SAS-MRAM system to enable 8-bit floating-point (FP8) neural network inference with an application in on-device session-based recommender system. A SAS-MRAM device prototype is fabricated with 4 MTJs sharing the same SOT metal line. The proposed SAS-MRAM-based PIM macro is designed in TSMC 28nm technology. It achieves 15.31 TOPS/W energy efficiency and 269 GOPS performance for FP8 operations at 700 MHz. Compared to state-of-the-art recommender systems for the same popular YooChoose dataset, it demonstrates a 86 ×, 1.8 ×, and 1.12 × higher energy efficiency than that of GPU, SRAM-PIM, and ReRAM-PIM, respectively.more » « lessFree, publicly-accessible full text available June 29, 2026
-
Ali, Asmer Hamid; Zhang, Fan; Yang, Li; Fan, Deliang (, ACM)Nowadays, parameter-efficient fine-tuning (PEFT) large pre-trained models (LPMs) for downstream task have gained significant popularity, since it could significantly minimize the training computational overhead. The representative work, LoRA [1], learns a low-rank adaptor for a new downstream task, rather than fine-tuning the whole backbone model. However, for inference, the large size of the learned model remains unchanged, leading to in-efficient inference computation. To mitigate this, in this work, we are the first to propose a learning-to-prune methodology specially designed for fine-tuning downstream tasks based on LPMs with low-rank adaptation. Unlike prior low-rank adaptation approaches that only learn the low-rank adaptors for downstream tasks, our method further leverages the Gumbel-Sigmoid tricks to learn a set of trainable binary channel-wise masks that automatically prune the backbone LPMs. Therefore, our method could leverage the benefits of low-rank adaptation to reduce the training parameters size and smaller pruned backbone LPM size for efficient inference computation. Extensive experiments show that the Pruned-RoBbase model with our method achieves an average channel-wise structured pruning ratio of 24.5% across the popular GLUE Benchmark, coupled with an average of 18% inference time speed-up in real NVIDIA A5000 GPU. The Pruned-DistilBERT shows an average of 13% inference time improvement with 17% sparsity. The Pruned-LLaMA-7B model achieves up to 18.2% inference time improvement with 24.5% sparsity, demonstrating the effectiveness of our learnable pruning approach across different models and tasks.more » « lessFree, publicly-accessible full text available January 20, 2026
An official website of the United States government
